Language-specific encoding in endangered language corpora

نویسندگان

Frank Seifart

Geoffrey Haig

Nikolaus P. Himmelmann

Dagmar Jung

Anna Margetts

Jost Gippert

چکیده

The paper addresses problems of corpus building and retrieval resulting from codeswitching, which is a characteristic feature of endangered language recordings. The typical appearance of code-switching phenomena is first outlined on the basis of data collected in the DoBeS ‘ECLinG’ project, which dealt with three endangered Caucasian languages spoken in Georgia: Tsova-Tush (Batsbi), Udi, and Svan. The problem of language-specific retrieval is illustrated with examples showing the usage of the word da in Tsova-Tush contexts, which represents, as a homonym, either a native copula form (‘it is’) or the Georgian conjunction ‘and’. The subsequent section discusses the annotation requirements that are necessary to automatically distinguish the languages involved in code-switching, with a focus on the emerging ISO standard 639-6. It is argued that the fine-grained distinction of varieties and subvarieties and their interrelationship – as aimed at in this standard – requires a thorough reconsideration if it is to be applied in the markup of corpus data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR

This project approaches the problem of language documentation and revitalization from a rather untraditional angle. To improve and facilitate language documentation of endangered languages, we attempt to use corpus linguistic methods and speech and language technologies to reduce the time needed for transcription and annotation of audio and video language recordings. The paper demonstrates this...

متن کامل

Language-specific encoding in multilingual corpora: Requirements and solutions

Dies ist eine Internet-Sonderausgabe des Aufsatzes „Language-specific encoding in multilingual corpora: Requirements and solutions“ von Jost Gippert (1999). Sie sollte nicht zitiert werden. Zitate sind der Originalausgabe in Multilinguale Corpora: Codierung, Strukturierung, Analyse. 11. Jahrestagung der Gesellschaft für Linguistische Datenverarbeitung (ed. J. Gippert / P. Olivier), Praha 1999, ...

متن کامل

Instant Annotations – Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora

Thepaper describes work-in-progress by the Pite Saami, Kola Saami and Izhva Komi language documentation projects, all of which use similar data and technical frameworks and are carried out in Freiburg and in collaboration with Hamburg, Syktyvkar, Tromsø and Uppsala. Our projects work in the endangered language documentation framework and record new spoken language data, digitize available recor...

متن کامل

Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora

The Corpus Encoding Standard (CES) is an application of SGML (ISO 8879:1986, Information Processing--Text and Office Systems--Standard Generalized Markup Language), conformant to the TEI Guidelines for Electronic Text Encoding and Interchange (Sperberg-McQueen and Burnard, 1994). It provides encoding conventions for linguistic corpora designed to be optimally suited for use in language engineer...

متن کامل

Endangered languages documentation: from standardization to mobilization

Currently, the main arena for computer-based linguistic contribution towards endangered languages is in data encoding and standardization. This phase urgently needs to be complemented by a period of working out how to deliver computer-based language support to endangered language communities. Established linguistic practice has neither sufficiently documented nor strengthened endangered languag...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Language-specific encoding in endangered language corpora

نویسندگان

چکیده

منابع مشابه

Endangered Language Documentation: Bootstrapping a Chatino Speech Corpus, Forced Aligner, ASR

Language-specific encoding in multilingual corpora: Requirements and solutions

Instant Annotations – Applying NLP Methods to the Annotation of Spoken Language Documentation Corpora

Corpus Encoding Standard: SGML Guidelines for Encoding Linguistic Corpora

Endangered languages documentation: from standardization to mobilization

عنوان ژورنال:

اشتراک گذاری